[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

heyihong · 2025-07-14T13:08:51Z

What changes were proposed in this pull request?

This PR optimizes the LiteralValueProtoConverter to reduce redundant type information in Spark Connect protocol buffers. The key changes include:

Optimized type inference for arrays and maps: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element.
Added needDataType parameter: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information.
Updated protobuf documentation: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference.
Improved test coverage: Added new test cases for complex nested structures including tuples and maps with array values.

Why are the changes needed?

The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time.

Does this PR introduce any user-facing change?

No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

heyihong · 2025-07-16T15:29:25Z

@hvanhovell @HyukjinKwon @beliefer

...t/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala

HyukjinKwon · 2025-07-16T22:57:27Z

cc @zhengruifeng too

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto

zhengruifeng · 2025-07-17T10:33:24Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/Serializer.scala

          case proto.Expression.Literal.LiteralTypeCase.BOOLEAN =>
            (literal.getBoolean.asInstanceOf[Object], classOf[Boolean])
          case proto.Expression.Literal.LiteralTypeCase.ARRAY =>
-            val array = literal.getArray


cc @WeichenXu123 for the ml side

heyihong · 2025-07-21T14:01:39Z

friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123

Update: No need to review at the moment — I need to finish SPARK-52930 first.

heyihong · 2025-09-01T09:53:53Z

@zhengruifeng @hvanhovell @HyukjinKwon PTAL

zhengruifeng · 2025-09-03T01:11:43Z

sql/connect/common/src/test/resources/query-tests/explain-results/function_typedLit.explain

@@ -1,2 +1,2 @@
-Project [id#0L, id#0L, 1 AS 1#0, null AS NULL#0, true AS true#0, 68 AS 68#0, 9872 AS 9872#0, -8726532 AS -8726532#0, 7834609328726532 AS 7834609328726532#0L, 2.718281828459045 AS 2.718281828459045#0, -0.8 AS -0.8#0, 89.97620 AS 89.97620#0, 89889.7667231 AS 89889.7667231#0, connect! AS connect!#0, T AS T#0, ABCDEFGHIJ AS ABCDEFGHIJ#0, 0x78797A7B7C7D7E7F808182838485868788898A8B8C8D8E AS X'78797A7B7C7D7E7F808182838485868788898A8B8C8D8E'#0, 0x0806 AS X'0806'#0, [8,6] AS ARRAY(8, 6)#0, null AS NULL#0, 2020-10-10 AS DATE '2020-10-10'#0, 8.997620 AS 8.997620#0, 2023-02-23 04:31:59.808 AS TIMESTAMP '2023-02-23 04:31:59.808'#0, 1969-12-31 16:00:12.345 AS TIMESTAMP '1969-12-31 16:00:12.345'#0, 2023-02-23 20:36:00 AS TIMESTAMP_NTZ '2023-02-23 20:36:00'#0, 2023-02-23 AS DATE '2023-02-23'#0, INTERVAL '0 00:03:20' DAY TO SECOND AS INTERVAL '0 00:03:20' DAY TO SECOND#0, INTERVAL '0-0' YEAR TO MONTH AS INTERVAL '0-0' YEAR TO MONTH#0, 23:59:59.999999999 AS TIME '23:59:59.999999999'#0, 2 months 20 days 0.0001 seconds AS INTERVAL '2 months 20 days 0.0001 seconds'#0, 1 AS 1#0, [1,2,3] AS ARRAY(1, 2, 3)#0, [1,2,3] AS ARRAY(1, 2, 3)#0, map(keys: [a,b], values: [1,2]) AS MAP('a', 1, 'b', 2)#0, [a,2,1.0] AS NAMED_STRUCT('_1', 'a', '_2', 2, '_3', 1.0D)#0, null AS NULL#0, [1] AS ARRAY(1)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, [[1,2,3],[4,5,6],[7,8,9]] AS ARRAY(ARRAY(1, 2, 3), ARRAY(4, 5, 6), ARRAY(7, 8, 9))#0, [keys: [a,b], values: [1,2],keys: [a,b], values: [3,4],keys: [a,b], values: [5,6]] AS ARRAY(MAP('a', 1, 'b', 2), MAP('a', 3, 'b', 4), MAP('a', 5, 'b', 6))#0, map(keys: [1,2], values: [keys: [a,b], values: [1,2],keys: [a,b], values: [3,4]]) AS MAP(1, MAP('a', 1, 'b', 2), 2, MAP('a', 3, 'b', 4))#0, [[1,2,3],keys: [a,b], values: [1,2],[a,keys: [1,2], values: [a,b]]] AS NAMED_STRUCT('_1', ARRAY(1, 2, 3), '_2', MAP('a', 1, 'b', 2), '_3', NAMED_STRUCT('_1', 'a', '_2', MAP(1, 'a', 2, 'b')))#0]


what is the diff here? why it changed the analyzed plan?

@zhengruifeng The diff is caused by an extra test case I added:

fn.typedLit( Seq( mutable.LinkedHashMap("a" -> Seq("1", "2"), "b" -> Seq("3", "4")), mutable.LinkedHashMap("a" -> Seq("5", "6"), "b" -> Seq("7", "8")), mutable.LinkedHashMap("a" -> Seq.empty[String], "b" -> Seq.empty[String])))

[keys: [a,b], values: [[1,2],[3,4]],keys: [a,b], values: [[5,6],[7,8]],keys: [a,b], values: [[],[]]] AS ARRAY(MAP('a', ARRAY('1', '2'), 'b', ARRAY('3', '4')), MAP('a', ARRAY('5', '6'), 'b', ARRAY('7', '8')), MAP('a', ARRAY(), 'b', ARRAY()))#0

zhengruifeng · 2025-09-08T04:33:27Z

@heyihong please resolve the conflicts

…teral.Array optional

zhengruifeng · 2025-09-09T08:49:07Z

merged to master

…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added SQL CONNECT labels Jul 14, 2025

heyihong changed the title ~~[SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Expression.Literal.Array optional~~ [WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional Jul 14, 2025

heyihong force-pushed the SPARK-52449 branch from 9ee011c to 35bb507 Compare July 14, 2025 17:01

github-actions bot added the PYTHON label Jul 14, 2025

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 62372ff to 2f4c24b Compare July 16, 2025 14:15

heyihong changed the title ~~[WIP][SPARK-52449][CONNECT] Make datatypes for Expression.Literal.Map/Array optional~~ [SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional Jul 16, 2025

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from 9e7737d to 71814d0 Compare July 16, 2025 15:28

heyihong force-pushed the SPARK-52449 branch from 71814d0 to 235f358 Compare July 16, 2025 15:59

hvanhovell reviewed Jul 16, 2025

View reviewed changes

...t/common/src/main/scala/org/apache/spark/sql/connect/common/LiteralValueProtoConverter.scala Outdated Show resolved Hide resolved

heyihong requested a review from hvanhovell July 16, 2025 19:00

heyihong force-pushed the SPARK-52449 branch 3 times, most recently from 7e5b478 to da00208 Compare July 16, 2025 19:57

heyihong force-pushed the SPARK-52449 branch from da00208 to fc694d4 Compare July 17, 2025 10:02

github-actions bot added the ML label Jul 17, 2025

zhengruifeng reviewed Jul 17, 2025

View reviewed changes

sql/connect/common/src/main/protobuf/spark/connect/expressions.proto Show resolved Hide resolved

zhengruifeng reviewed Jul 17, 2025

View reviewed changes

heyihong force-pushed the SPARK-52449 branch from fc694d4 to b6f0561 Compare July 17, 2025 13:33

heyihong changed the title ~~[SPARK-52449][CONNECT][PYTHON][SQL] Make datatypes for Expression.Literal.Map/Array optional~~ [SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional Jul 17, 2025

heyihong requested a review from zhengruifeng July 17, 2025 13:41

heyihong force-pushed the SPARK-52449 branch from b6f0561 to aabcc02 Compare July 17, 2025 14:03

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from c819bcb to c579c1c Compare July 21, 2025 13:56

heyihong force-pushed the SPARK-52449 branch from c579c1c to 0682838 Compare July 23, 2025 15:02

heyihong force-pushed the SPARK-52449 branch from 0682838 to d017398 Compare August 28, 2025 14:05

github-actions bot removed the PYTHON label Aug 28, 2025

heyihong force-pushed the SPARK-52449 branch from d017398 to 155d65e Compare August 28, 2025 14:10

github-actions bot added the PYTHON label Aug 28, 2025

heyihong force-pushed the SPARK-52449 branch 4 times, most recently from f2387fe to e8daef5 Compare September 1, 2025 09:49

heyihong force-pushed the SPARK-52449 branch from e8daef5 to 7a6bb95 Compare September 1, 2025 13:35

zhengruifeng reviewed Sep 3, 2025

View reviewed changes

heyihong requested a review from zhengruifeng September 5, 2025 12:25

zhengruifeng approved these changes Sep 8, 2025

View reviewed changes

heyihong force-pushed the SPARK-52449 branch 3 times, most recently from e5e59aa to dc6ed7c Compare September 8, 2025 10:38

[SPARK-52449] Make datatypes for Expression.Literal.Map/Expression.Li…

e301da3

…teral.Array optional

heyihong force-pushed the SPARK-52449 branch from 313835d to e301da3 Compare September 8, 2025 12:43

zhengruifeng closed this in 1d09e5a Sep 9, 2025

dongjoon-hyun mentioned this pull request Oct 1, 2025

[SPARK-53777] Update Spark Connect-generated Swift source code with 4.1.0-preview2 apache/spark-connect-swift#250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

Uh oh!

heyihong commented Jul 14, 2025 •

edited

Loading

Uh oh!

heyihong commented Jul 16, 2025

Uh oh!

Uh oh!

HyukjinKwon commented Jul 16, 2025

Uh oh!

Uh oh!

zhengruifeng Jul 17, 2025

Uh oh!

heyihong commented Jul 21, 2025 •

edited

Loading

Uh oh!

heyihong commented Sep 1, 2025 •

edited

Loading

Uh oh!

zhengruifeng Sep 3, 2025

Uh oh!

heyihong Sep 4, 2025 •

edited

Loading

Uh oh!

zhengruifeng commented Sep 8, 2025

Uh oh!

zhengruifeng commented Sep 9, 2025

Uh oh!

Uh oh!

		@@ -1,2 +1,2 @@
		Project [id#0L, id#0L, 1 AS 1#0, null AS NULL#0, true AS true#0, 68 AS 68#0, 9872 AS 9872#0, -8726532 AS -8726532#0, 7834609328726532 AS 7834609328726532#0L, 2.718281828459045 AS 2.718281828459045#0, -0.8 AS -0.8#0, 89.97620 AS 89.97620#0, 89889.7667231 AS 89889.7667231#0, connect! AS connect!#0, T AS T#0, ABCDEFGHIJ AS ABCDEFGHIJ#0, 0x78797A7B7C7D7E7F808182838485868788898A8B8C8D8E AS X'78797A7B7C7D7E7F808182838485868788898A8B8C8D8E'#0, 0x0806 AS X'0806'#0, [8,6] AS ARRAY(8, 6)#0, null AS NULL#0, 2020-10-10 AS DATE '2020-10-10'#0, 8.997620 AS 8.997620#0, 2023-02-23 04:31:59.808 AS TIMESTAMP '2023-02-23 04:31:59.808'#0, 1969-12-31 16:00:12.345 AS TIMESTAMP '1969-12-31 16:00:12.345'#0, 2023-02-23 20:36:00 AS TIMESTAMP_NTZ '2023-02-23 20:36:00'#0, 2023-02-23 AS DATE '2023-02-23'#0, INTERVAL '0 00:03:20' DAY TO SECOND AS INTERVAL '0 00:03:20' DAY TO SECOND#0, INTERVAL '0-0' YEAR TO MONTH AS INTERVAL '0-0' YEAR TO MONTH#0, 23:59:59.999999999 AS TIME '23:59:59.999999999'#0, 2 months 20 days 0.0001 seconds AS INTERVAL '2 months 20 days 0.0001 seconds'#0, 1 AS 1#0, [1,2,3] AS ARRAY(1, 2, 3)#0, [1,2,3] AS ARRAY(1, 2, 3)#0, map(keys: [a,b], values: [1,2]) AS MAP('a', 1, 'b', 2)#0, [a,2,1.0] AS NAMED_STRUCT('_1', 'a', '_2', 2, '_3', 1.0D)#0, null AS NULL#0, [1] AS ARRAY(1)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, map(keys: [1], values: [0]) AS MAP(1, 0)#0, [[1,2,3],[4,5,6],[7,8,9]] AS ARRAY(ARRAY(1, 2, 3), ARRAY(4, 5, 6), ARRAY(7, 8, 9))#0, [keys: [a,b], values: [1,2],keys: [a,b], values: [3,4],keys: [a,b], values: [5,6]] AS ARRAY(MAP('a', 1, 'b', 2), MAP('a', 3, 'b', 4), MAP('a', 5, 'b', 6))#0, map(keys: [1,2], values: [keys: [a,b], values: [1,2],keys: [a,b], values: [3,4]]) AS MAP(1, MAP('a', 1, 'b', 2), 2, MAP('a', 3, 'b', 4))#0, [[1,2,3],keys: [a,b], values: [1,2],[a,keys: [1,2], values: [a,b]]] AS NAMED_STRUCT('_1', ARRAY(1, 2, 3), '_2', MAP('a', 1, 'b', 2), '_3', NAMED_STRUCT('_1', 'a', '_2', MAP(1, 'a', 2, 'b')))#0]

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional #51473

Uh oh!

Conversation

heyihong commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

heyihong commented Jul 16, 2025

Uh oh!

Uh oh!

HyukjinKwon commented Jul 16, 2025

Uh oh!

Uh oh!

zhengruifeng Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heyihong commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

heyihong Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 8, 2025

Uh oh!

zhengruifeng commented Sep 9, 2025

Uh oh!

Uh oh!

heyihong commented Jul 14, 2025 •

edited

Loading

heyihong commented Jul 21, 2025 •

edited

Loading

heyihong commented Sep 1, 2025 •

edited

Loading

heyihong Sep 4, 2025 •

edited

Loading